To evaluate the performance of the electronics store, this project will address the following questions:
Product Hierarchy: Which product categories contribute most to the total revenue, and does a higher unit price correlate with higher total sales volume?
Hypothesis: While premium items (e.g., MacBooks) generate high revenue per sale, mid-tier accessories (e.g., USB-C Charging Cables) will have the highest frequency of transactions, driving consistent cash flow.
Temporal & Geographic Patterns: Are there specific “peak hours” for electronics shopping that differ across major US cities?
Hypothesis: Tech-heavy hubs like San Francisco will show higher purchase volumes during late-evening hours compared to other metropolitan areas.
Dataset Description
The dataset used in this analysis represents 12 months of sales data from a multi-city electronics retailer.
Source: The data is sourced from the Kaggle Electronics Store Dataset.
Size: It contains approximately 186,000 rows and 6 initial columns.
Content: It includes variables such as Order ID, Product, Quantity Ordered, Price Each, Order Date, and Purchase Address.
Relevance
This dataset is critical for answering my research questions because it provides the granularity required for both temporal (time-based) and categorical analysis. The inclusion of timestamped orders allows for the application of lubridate to extract seasonal trends, while the product-level pricing enables a deep dive into revenue distribution using dplyr’s .by argument.
Analytical Roadmap
Exploration: Use skimr to identify data distributions and missingness.
Cleaning: Parse dates, split addresses into cities, and handle numeric conversions.
Analysis: Filter for high-value transactions and use grouped summaries to compare city-level performance.
Visualization: Create a dashboard of sales trends using patchwork and plotly to identify peak periods visually.
Loading Libraries and Data
# Loading necessary libraries for the projectlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(DT) library(skimr) library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(knitr)
# inslalling webshot2 visual in pdfinstall.packages("webshot2")
Installing package into '/Users/heinhtetsoethan/Library/R/arm64/4.5/library'
(as 'lib' is unspecified)
Error in `contrib.url()`:
! trying to use CRAN without setting a mirror
# for pdf exportinstall.packages("tinytex")
Installing package into '/Users/heinhtetsoethan/Library/R/arm64/4.5/library'
(as 'lib' is unspecified)
Error in `contrib.url()`:
! trying to use CRAN without setting a mirror
tinytex::install_tinytex()
Found '/usr/local/bin/tlmgr', which indicates a LaTeX distribution may have existed in the system.
Error in `tinytex::install_tinytex()`:
! If you want to force installing TinyTeX anyway, use tinytex::install_tinytex(force = TRUE).
Why this library?
tidyverse: Provides a cohesive set of tools for data manipulation (dplyr), visualization (ggplot2), and data tidying (tidyr). Its consistent syntax allows for efficient data wrangling and analysis.
lubridate: Simplifies the parsing and manipulation of date-time data, which is crucial for analyzing temporal patterns in sales.
patchwork: Enables the combination of multiple ggplot2 plots into a single cohesive dashboard, facilitating comparative analysis across different dimensions.
plotly: Allows for the creation of interactive visualizations that can enhance user engagement and provide deeper insights through hover information and dynamic filtering.
DT: Provides an interactive interface for displaying data tables, making it easier to explore and understand the dataset.
skimr: Offers a comprehensive overview of the dataset, including distributions, missing values, and summary statistics, which is essential for the initial data exploration phase.
scales: Useful for formatting axes and labels, particularly when dealing with financial data, to improve readability and presentation quality.
knitr: Useful for formatting axes and labels, particularly when dealing with financial data, to improve readability and presentation quality.
Section 2: Data Exploration
In this section, I import the raw sales data and perform an initial inspection to understand the variable distributions and identify potential data quality issues.
Merging Datasets manually
I will consolidate the 12 monthly CSV files into a single primary data frame using a functional programming approach. During the import phase, I will use lubridate to parse the Order Date and create a dedicated Month identifier. This consolidation is essential for performing longitudinal analysis and ensures that cross-month trends—such as seasonal spikes—are captured accurately in a single, unified tibble.
Note
I uploated the raw data set on GitHub.
# Taking the data from GitHubif (!dir.exists("Monthly_Sales_Data")) {dir.create("Monthly_Sales_Data") data_url <-"https://github.com/Hein96/Electronic-Sales-Data.git" dest_file <-"Monthly_Sales_Data.zip"download.file(data_url, dest_file, mode ="wb")unzip(dest_file, exdir ="Monthly_Sales_Data")}
Rows: 18383 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12011 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 25117 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 12036 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 9723 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 14371 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 13622 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 15226 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 16635 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 17661 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 20379 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 11686 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Order ID, Product, Quantity Ordered, Price Each, Order Date, Purcha...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sales_raw) # Get a quick overview of the data structure
#| summary: truesummary(sales_raw) # Get summary statistics for numeric columns
Order ID Product Quantity Ordered Price Each
Length:186850 Length:186850 Length:186850 Length:186850
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Order Date Purchase Address
Length:186850 Length:186850
Class :character Class :character
Mode :character Mode :character
skimr::skim(sales_raw) # Get a comprehensive overview of the dataset
I am encountering NA because the Price Each column contains non-numeric values (e.g., “Price Each” as a header or other text entries). To address this, I will need to clean the data by converting the Price Each column to numeric and handling any non-numeric entries appropriately.
sales_raw <- sales_raw |>mutate(`Price Each`=as.numeric(`Price Each`)) # Convert to numeric, coercing non-numeric values to NA
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Price Each = as.numeric(`Price Each`)`.
Caused by warning:
! NAs introduced by coercion
Now that the Price Each column has been converted to numeric, we can see the summary statistics without NA values affecting the calculations. The mean, median, and standard deviation of the unit price are now accurately represented, allowing for a better understanding of the pricing distribution in the dataset.
Note
Technical Choice: Data Consolidation While the raw data consists of 12 individual monthly files, I have chosen to perform the primary exploration on the consolidated dataset. This allows for a holistic view of the store’s annual performance and ensures that the summary statistics (mean, median, etc.) reflect the entire fiscal year rather than a single month.
Initial Observations & Anomalies
Based on the exploration above, I have identified several patterns and issues that must be addressed in the cleaning phase:
Data Type Mismatch: The Order Date is currently stored as a character string and needs conversion to a POSIXct object.
Missing Values: Preliminary checks via skim() indicate a small percentage of empty rows that need to be removed.
Column Naming: Several column names contain spaces (e.g., Order ID), which makes coding cumbersome. These will be renamed for consistency.
Logical Anomalies: I observed that Quantity Ordered is currently a character type in some instances; this must be coerced to numeric to allow for revenue calculation.
Section 3: Data Cleaning
Initial Integrity Check
Before cleaning, I calculate the total number of missing values to understand the extent of data loss.
sum(is.na(sales_raw))
[1] 3625
The Cleaning Pipeline
I used a single piped workflow to clean the data. I chose to drop rows with missing Order IDs rather than imputing them, as an Order ID is a unique identifier that cannot be accurately guessed. For prices, I used grouped median imputation to maintain data integrity.
sales_cleaned <- sales_raw |>#| 1. Standardizing column namesrename(order_id =`Order ID`,product =`Product`,quantity =`Quantity Ordered`,unit_price =`Price Each`,order_date =`Order Date`,purchase_address =`Purchase Address` ) |>#| 2. Filter out the repeating header rows filter(product !="Product") |>#| 3. Fix data types and parse datesmutate(quantity =as.numeric(quantity),unit_price =as.numeric(unit_price),order_date =mdy_hm(order_date) # Parsing with lubridate ) |>#| 4. String cleaning: Extract City #| I extract the text between the two commas in the addressmutate(city =str_split_i(purchase_address, ",", 2) |>str_trim() ) |>#| 5. Grouped Imputation mutate(unit_price =if_else(is.na(unit_price), median(unit_price, na.rm =TRUE), unit_price ),.by = product ) |>#| 6. Drop rows with missing Order IDs and in quantitydrop_na(order_id, quantity)
Handling Headers: I used filter(product != “Product”) because the 12-month merge duplicates the header row 11 times. Leaving these in would cause errors in numeric conversion.
Imputation Strategy: I chose grouped median imputation for prices because electronics have fixed prices per SKU. A mean might be skewed by outliers, whereas the median for a specific product like “iPhone” is highly accurate.
Date Formatting: I converted order_date to a POSIXct object immediately to allow for time-series analysis.
Section 4: Data Analysis
In this section, I perform a series of analytical operations to extract business insights from the consolidated electronics sales data.
Product-discovery
I will identify the most expensive items in the inventory to see where the “Premium” tier naturally begins.
# 1. List unique products and their prices, sorted by price# I use .by to ensure we get the correct price for each item high_value_products <- sales_cleaned |>summarise(unit_price =max(unit_price), .by = product ) |>arrange(desc(unit_price))# 2. Display the top 10 most expensive products# This justifies why we will later use the $600 thresholdhigh_value_products |>head(10)
# A tibble: 10 × 2
product unit_price
<chr> <dbl>
1 Macbook Pro Laptop 1700
2 ThinkPad Laptop 1000.
3 iPhone 700
4 Google Phone 600
5 LG Dryer 600
6 LG Washing Machine 600
7 Vareebadd Phone 400
8 27in 4K Gaming Monitor 390.
9 34in Ultrawide Monitor 380.
10 Flatscreen TV 300
Note
I have filtered the high value product and apperently Macbook, ThinkPad and iPhone are the highest, and all three of them are greater than $600. Focusing on these products is useful for examining premium sales dynamics, such as revenue concentration and customer willingness to pay.
Global High-Value Analysis
The global approach ensures that my analysis captures all high-end electronics (like ThinkPads and high-end monitors) rather than focusing solely on a single brand. I then calculated the total revenue and order volume for each segment to see which tier drives the store’s financial performance.Now that I have confirmed which products are expensive (likely MacBooks, iPhones, and ThinkPads), we can filter for all high-value products which are greater than $600 regardless of the brand. I will assign the the segment by regarding the price e.g. >$600 for “Premium”, >$150 for “Mid-Range”, and <=$150 for “Budget”.
# Numeric filter for all products > 600# This creates a 'Premium Segment' datasethigh_value_segment <- sales_cleaned |>filter(unit_price >600)# Use case_when to categorize the WHOLE dataset# This adds context to every row (Above & Beyond categorization)sales_analysis <- sales_cleaned |>mutate(month =month(order_date, label =TRUE, abbr =TRUE), # Extracts the month name from the order_datemarket_segment =case_when( unit_price >600~"Premium", unit_price >150~"Mid-Range", unit_price <=150~"Budget" ))# Summary of sales by market segmentsegment_summary <- sales_analysis |>summarise(total_revenue =sum(quantity * unit_price),avg_order_size =mean(quantity),transaction_count =n(),.by = market_segment ) |>arrange(desc(total_revenue))
This created an order_type variable to separate normal purchases from “Multi-Unit” bulk orders – important for understanding buying patterns.
# 1. select() with helper functions # I select the ID and any column related to the 'order' or 'product'refined_view <- sales_analysis |>select(order_id, starts_with("order"), contains("product"))# 2. mutate() with if_else() # Creating a binary flag to identify "Multi-Unit" orders sales_analysis <- sales_analysis |>mutate(order_type =if_else(quantity >1, "Multi-Unit", "Single-Unit"))# 3. arrange() in ascending order # Checking our lowest-priced items firstbudget_check <- sales_analysis |>arrange(unit_price) |>head(10)
I use a pivot operation to transform the data from a long format into a wide, readable matrix. This cross-tabulation provides a clear comparative view of how revenue is distributed across different cities and market segments simultaneously. This view is particularly useful for identifying which geographic regions have the highest concentration of “Premium” shoppers.
# Pivot Operation# This transforms the data from 'Long' to 'Wide' format for a summary reportcity_segment_matrix <- sales_analysis |>summarise(revenue =sum(quantity * unit_price), .by =c(city, market_segment)) |>pivot_wider(names_from = market_segment, values_from = revenue)# Display the resultcity_segment_matrix
# A tibble: 9 × 4
city Budget `Mid-Range` Premium
<chr> <dbl> <dbl> <dbl>
1 Dallas 497339. 912940. 1357697.
2 Boston 668830. 1204417. 1788396.
3 Los Angeles 1011927. 1740350. 2700294.
4 San Francisco 1534063. 2671251. 4056890.
5 Seattle 501730. 925629. 1320397.
6 Atlanta 509482. 903920. 1382096.
7 New York City 859720. 1511003. 2293594.
8 Portland 435602. 722992. 1161897.
9 Austin 341530. 577454. 900598.
Advanced Filtering & Top/Bottom Slicing
To demonstrate precise data extraction, I used string matching and set membership (%in%) to isolate premium “Pro” models in specific tech-heavy cities. I also utilized slice_max() to identify the top 5 most profitable transactions, providing a snapshot of the store’s most successful individual sales.
# Identifying high-end 'Pro' or 'ThinkPad' models in key tech hubstech_hub_premium <- sales_analysis |>filter( city %in%c("San Francisco", "Seattle", "Austin"),str_detect(product, "Pro|ThinkPad") )# Slice_max (Top 5 most profitable individual orders)top_5_orders <- sales_analysis |>mutate(revenue = quantity * unit_price) |>slice_max(revenue, n =5)
I implemented window functions including cumsum() and lag() to analyze the store’s growth trajectory. By calculating the cumulative revenue and the daily change in sales, I can visualize the store’s momentum and identify specific days where revenue significantly outperformed the previous period.
# Calculating daily revenue growth and comparing to the previous daydaily_growth <- sales_analysis |>summarise(daily_rev =sum(quantity * unit_price), .by = order_date) |>arrange(order_date) |>mutate(cumulative_revenue =cumsum(daily_rev),prev_day_rev =lag(daily_rev),rev_change = daily_rev - prev_day_rev )
Finally, I performed a correlation analysis to examine the relationship between price and quantity. This helps determine if higher price points lead to a significant drop in the number of units sold. I also used a cross-tabulation to count how order types (Single vs. Multi-Unit) are distributed across the different market segments.
# Checking if people buy fewer items when the price is higherprice_quantity_corr <- sales_analysis |>summarise(correlation =cor(unit_price, quantity))# Breakdown of order types by market segmentsegment_counts <- sales_analysis |>count(market_segment, order_type)
The calculated correlation between unit price and quantity is -0.148. While this confirms a negative relationship—supporting the law of demand—the correlation is weak. This is primarily because consumer electronics are ‘discreet’ purchases; the data shows that whether a customer is buying a budget charging cable or a premium laptop, the modal (most common) quantity remains one unit.
Section 5: Visualization
Distribution of Pricing (Boxplot)
Due to the significant skewness in product pricing—ranging from $2.99 batteries to $1,700 MacBooks—a standard linear histogram hides the distribution of premium goods. I have chosen a Boxplot to more effectively visualize the volume of high-value transactions relative to budget accessories.
# 1. Boxplotp1 <-ggplot(sales_analysis, aes(x = market_segment, y = unit_price, fill = market_segment)) +geom_boxplot(alpha =0.7) +scale_y_continuous(labels =label_dollar()) +labs(title ="Price Distribution by Market Segment",subtitle ="Visualizing the spread and outliers across tiers",x ="Market Segment", y ="Unit Price") +theme_minimal() +scale_fill_viridis_d(option ="mako")p1
Figure 1: Distribution of unit prices across market segments.
Note
The Question: Is there a big difference between “Budget” and “Premium” prices?
The Finding: This chart shows that Budget items are all priced very closely together, while Premium items (like MacBooks) have a very wide price range. It proves our $600 “High-Value” rule was the right choice.
Revenue by City & Segment (Bivariate & Grouped)
This bar chart shows which cities generate the most revenue, broken down by market segment.
# 2. Bar Chartp2 <- sales_analysis |>summarise(total_rev =sum(quantity * unit_price), .by =c(city, market_segment)) |>ggplot(aes(x =reorder(city, total_rev), y = total_rev, fill = market_segment)) +geom_col(position ="dodge") +geom_text(aes(label =label_comma(accuracy =1, scale =1e-6, suffix ="M")(total_rev)), position =position_dodge(width =0.9), vjust =-0.5, size =3) +scale_y_continuous(labels =label_dollar(scale =1e-6, suffix ="M")) +scale_fill_viridis_d(option ="magma", end =0.8) +labs(title ="Total Revenue by City and Market Segment",x ="City", y ="Total Revenue ($ Millions)", fill ="Segment") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))p2
Figure 2: Total revenue contribution by city and segment.
Note
The Question: Which cities spend the most on expensive electronics?
The Finding: Big tech cities like San Francisco and NYC are the clear leaders. They don’t just buy more items; they specifically buy more “Premium” items compared to smaller cities.
Correlation: Price vs. Quantity (Scatter Plot)
To visualize the -0.148 correlation calculated from Correlation & Stats session, I used a scatter plot with a trend line.
# 3. Scatter Plot p3 <-ggplot(sales_analysis |>sample_n(5000), aes(x = unit_price, y = quantity)) +geom_jitter(alpha =0.2, color ="#3498db", size =1) +# Size/Alpha mappinggeom_smooth(method ="lm", color ="red", se =FALSE) +# Trend linelabs(title ="Correlation: Price vs. Quantity Ordered",subtitle ="Sample of 5,000 orders showing weak negative correlation",x ="Unit Price ($)", y ="Quantity") +theme_minimal()p3
`geom_smooth()` using formula = 'y ~ x'
Figure 3: Correlation between unit price and quantity.
Note
The Question: Do people buy fewer items when the price goes up?
The Finding: Yes, but only slightly. The trend line goes down, which proves that people usually buy expensive laptops one at a time, but might buy several cheap cables at once.
Monthly Trends by Segment (Faceted Line Chart)
This visualization tracks the store’s performance over the year, faceted by the three market segments.
# 4. Line Chart & 5. Faceting p4 <- sales_analysis |>summarise(monthly_rev =sum(quantity * unit_price), .by =c(month, market_segment)) |>ggplot(aes(x = month, y = monthly_rev, group = market_segment, color = market_segment)) +geom_line(linewidth =1) +geom_point(size =2) +facet_wrap(~market_segment, scales ="free_y") +# Faceted panels (Req 5.5)scale_y_continuous(labels =label_dollar(scale =1e-3, suffix ="K")) +labs(title ="Revenue Trends Over Time",subtitle ="Faceted by Market Segment (Axes are independent)",x ="Month", y ="Monthly Revenue ($ Thousands)") +theme_light() +theme(axis.text.x =element_text(angle =90))p4
Figure 4: Monthly Revenue Trends by Market Segment.
Note
The Question: Do sales change during different months for each segment?
The Finding: Budget items stay flat and steady all year. However, Premium items see a huge “spike” at the end of the year, likely due to holiday shopping.
Dashboard (Patchwork)
I combine the key insights into a unified dashboard for an executive summary.
# 6. Combined Dashboard library(patchwork)dashboard <- (p1 + p3) / p2 +plot_annotation(title ="Electronics Store Annual Sales Dashboard",subtitle ="Analysis of 180k+ transactions from 2019",caption ="Data Source: Kaggle Electronics Sales Dataset" )dashboard
`geom_smooth()` using formula = 'y ~ x'
Figure 5: Combined Dashboard
Seasonal Heatmap of Orders
# 7. Heatmapp5 <- sales_analysis |># First, create a 'day_of_week' columnmutate(day_of_week =wday(order_date, label =TRUE)) |>count(month, day_of_week) |>ggplot(aes(x = month, y = day_of_week, fill = n)) +geom_tile(color ="white") +scale_fill_viridis_c(option ="plasma") +labs(title ="Order Volume Heatmap",subtitle ="Identifying peak shopping periods by day and month",x ="Month", y ="Day of Week", fill ="Orders") +theme_minimal()p5
Figure 6: Seasonal Heatmap of Orders.
Note
The Question: What are the busiest days and months for orders?
The Finding: This map highlights “hot spots” in December. It tells a business manager exactly when they need more staff in the warehouse to handle the extra orders.
Price vs. Quantity by Segment (Multivariate Scatter)
Instead of a sample of 5,000, I will look at the mean relationship for every product, using color to show the segments.
# 8. Advanced Scatter p6 <- sales_analysis |>summarise(avg_price =mean(unit_price),avg_qty =mean(quantity),.by =c(product, market_segment) ) |>ggplot(aes(x = avg_price, y = avg_qty, color = market_segment)) +geom_point(size =4, alpha =0.8) +geom_text(aes(label = product), check_overlap =TRUE, vjust =-1, size =3) +scale_color_viridis_d() +labs(title ="Product Level: Price vs. Average Quantity",x ="Average Unit Price", y ="Average Quantity per Order") +theme_classic()p6
Figure 7: Price vs. Quantity by Segment.
Note
The Question: Which specific products are the “Best Sellers” in each category?
The Finding: By looking at individual products, we can see which ones are expensive but still very popular. This helps us decide which items to put on the front page of the website.
Daily Revenue Volatility (Time Series)
Using the daily_growth data you created in Section 4, I can plot the daily change in revenue to show how “stable” the business is.
# 9. Line Chart p7 <- daily_growth |>ggplot(aes(x = order_date, y = rev_change)) +geom_line(color ="grey") +geom_point(aes(color = rev_change >0), size =0.5) +geom_hline(yintercept =0, linetype ="dashed", color ="red") +labs(title ="Daily Revenue Volatility",subtitle ="Points above the red line indicate growth vs. previous day",x ="Date", y ="Revenue Change ($)") +theme_minimal() +theme(legend.position ="none")p7
Figure 8: Daily Revenue Volatility.
Note
The Question: How much does revenue change from one day to the next?
The Finding: This chart shows “volatility.” If the dots are far from the red line, it means sales are jumping up and down. This helps us see if our daily promotions are actually working.
Cumulative Growth Curve
This final plot shows the “Story of the Year”—how the store reached its total annual revenue.
The Question: How did the store’s total money grow over the whole year?
The Finding: This is the “Big Picture.” It shows that the store started slow but finished the year with massive growth, reaching its multi-million dollar goal by December.
Section 6: Interactive Elements
Interactive Data Exploration (DT)
Instead of a static table, I have provided an interactive data browser. This allows you to filter by city, search for specific products, or sort by revenue to find hidden patterns.
Research Insight: Data Portability The Question: How can managers use this data for their own reports?
The Finding: By adding CSV and Excel buttons, I turn a static report into a functional business tool. Stakeholders can filter for a specific city or product and instantly download that specific data subset.
Tip
I have implemented the dom = ‘Bfrtip’ configuration. This ensures that the Buttons, filter, reader, table, information, and pagination all load in a clean, professional layout.
Important
Exporting Filtered Data The download buttons are “reactive”. If you use the filters at the top of the columns to show only “Premium” items, clicking the Excel button will download only those filtered premium rows, not the entire dataset.
Interactive City Performance (ggplotly)
I have converted the Section 5.2 bar chart into an interactive version. You can now hover over any bar to see the exact dollar amount and the specific market segment.
# Convert ggplot to ggplotlylibrary(plotly)# Using the previous p2 object but adding a 'text' aesthetic for custom hoverp2_interactive <- sales_analysis |>summarise(total_rev =sum(quantity * unit_price), .by =c(city, market_segment)) |>mutate(hover_text =paste("City:", city, "<br>Segment:", market_segment, "<br>Revenue: $", format(round(total_rev, 0), big.mark =","))) |>ggplot(aes(x =reorder(city, total_rev), y = total_rev, fill = market_segment, text = hover_text)) +geom_col(position ="dodge") +scale_fill_viridis_d(option ="magma") +theme_minimal() +labs(title ="Hover to Explore Revenue by City", x ="City", y ="Total Revenue") +theme(axis.text.x =element_text(angle =45, hjust =1))# Convert to interactiveggplotly(p2_interactive, tooltip ="text")
Figure 11: Interactive Bar Chart.
Note
Research Insight: Geographic Revenue The Question: Which specific segment is the primary driver for our top-earning cities?
The Finding: Hovering over the bars reveals that in San Francisco, the “Premium” segment contributes more than double the revenue of the “Budget” and “Mid-Range” segments combined.
Tip
Navigation Tip You can double-click on a segment in the legend to isolate that specific tier across all cities, or click and drag to zoom into a specific group of cities.
Important
Comparative Analysis Note that the y-axis represents total revenue. While some bars look similar in height, the hover text provides the exact dollar precision needed for financial auditing.
Product Value Analysis (Native plot_ly)
This plot is built using “Native Plotly” logic. It visualizes the relationship between a product’s average price and its total revenue contribution.
# Native plotly with plot_ly()# Custom hover text with 3+ variablesplot_ly(data = table_data,x =~Avg_Price,y =~Total_Revenue,color =~market_segment,type ='scatter',mode ='markers',marker =list(size =10, opacity =0.6),text =~paste("Product:", product, "<br>City:", city, "<br>Total Revenue: $", format(Total_Revenue, big.mark =",")),hoverinfo ='text') |>layout(title ="Interactive Product Performance Matrix",xaxis =list(title ="Average Unit Price ($)"),yaxis =list(title ="Total Revenue ($)") )
Figure 12: Product price vs revenue by market segment (hover for details).
Note
Research Insight: Value Matrix The Question: Is there a clear “break point” where price starts to impact total revenue?
The Finding: This scatter plot shows a “revenue cluster” for items priced between $600 and $1,700, indicating that our “Premium” strategy is successfully capturing high-value transactions.
Tip
Interactive Discovery Hover over the outliers on the far right of the x-axis to identify the specific high-end products that drive our top-tier revenue despite having lower sales frequency.
Important
Multi-Variable Context The hover text for each point displays three key variables: Product Name, City, and Total Revenue, allowing for a three-dimensional understanding of performance in a single view.
Question 1: Does product pricing significantly impact customer purchasing volume?
Yes. My analysis confirms a negative correlation of -0.148 between unit price and quantity. As you can see in the Quantity Scatter Plot Figure 3, high-value items are almost exclusively purchased as single units, whereas budget items drive higher multi-unit transaction volume.
Question 2: Which geographic regions are the primary drivers for premium revenue?
The data shows that coastal tech hubs are the dominant force. According to the City Revenue Bar Chart Figure 2, San Francisco leads with a total revenue of $8,262,204, driven largely by the Premium segment. Smaller inland cities show significantly lower “willingness to pay” for high-end hardware.
Question 3: How does seasonality affect different market tiers?
Seasonality is most aggressive in the Premium tier. While the Budget segment remains stable, the Faceted Line Chart Figure 4 reveals a massive revenue spike in the Premium segment during the final quarter, likely due to holiday gifting and year-end corporate spending.
Geographic Targeting: Reallocate 20% of the marketing budget from lower-performing cities like Austin or Portland toward San Francisco and NYC specifically for Premium product promotions.
Inventory Optimization: Implement a “Just-in-Time” inventory model for Budget accessories (cables/batteries) during Q1-Q3 to reduce holding costs, as demand is highly predictable and flat.
Bundling Strategy: Since the correlation between price and quantity is weak (-0.148), the store should bundle Premium items with Budget accessories (e.g., a MacBook + USB-C Hub) to increase the “Units Per Transaction” in the high-value segment.
While the analysis is robust, it has the following limitations:
Time Constraint: The dataset is limited to a single year (2019), which prevents me from distinguishing between “Yearly Growth” and “Standard Seasonality.”
Missing Demographics: We lack customer age or gender data, meaning we cannot build detailed persona-based marketing strategies.
External Factors: The analysis does not account for external market benchmarks or competitor pricing, which likely influenced the observed sales trends.
Methodology Reflection: Choosing a Boxplot over a Histogram for Section 5 was a critical choice. The extreme price skewness made traditional distribution plots unreadable, and the boxplot allowed for a much cleaner comparison of price “spread” across tiers.
Future studies should incorporate Customer Lifetime Value (CLV) modeling. By joining this sales data with customer support or return data, we could identify which segments are not only high-revenue but also high-profit after accounting for post-sale costs.
AI Usage Log
I used Gemini 2.0 Flash as a coding assistant throughout this project to help with technical challenges and document formatting. At the start, I asked Gemini to help me understand specific R error messages, such as the “incomplete string” error in my inline code. I also asked Gemini to debug my custom function for calculating city revenue because I was having trouble with the function scope. When I moved to the visualization sections, I asked Gemini for syntax assistance to use the native pipe |> and the .by argument instead of group_by().
To improve the report’s structure, I asked Gemini to explain how to use Quarto features like callout blocks and tabsets. For the final bonus section, I asked Gemini to help me make my presentation more beautiful by suggesting an AI-themed layout with columns. Every time the AI provided code, I modified the labels and variable names to match my sales_analysis dataset and tested each chunk to ensure it ran without errors. I made sure to manually check all the mathematical outputs, like the San Francisco revenue total, against my raw data to ensure accuracy before final rendering.
Presentation
RevealJS Presentation Structure
Slide 1: Title Slide
Electronic Store Sales Analysis
Strategic Analysis of Electronics Sales
Data-Driven Insights for Marketing Reallocation
Author: Hein Htet SOE THAN Institution: NEOMA Business School Date: February 2026
Slide 2: Business Objectives
Core Research Pillars
Primary Goal: Identify high-value geographic hubs to optimize marketing spend.
Market Dynamics: Analyze price elasticity and volume trends across diverse product segments.
Seasonality: Determine peak periods for premium hardware acquisition.
Slide 3: Data Methodology
Processed 185,000+ raw transaction rows with 100% validation.
Scale: Processed 185,000+ transaction rows from 12 monthly datasets.
Cleaning: Handled missing values and standardized data types for time-series analysis.
Feature Engineering: Segmented products into Budget, Mid-Range, and Premium based on unit price distribution.